library(ggfortify)
package ‘ggfortify’ was built under R version 3.6.2Loading required package: ggplot2
package ‘ggplot2’ was built under R version 3.6.2Learn more about the underlying theory at https://ggplot2-book.org/
library(mosaic)
package ‘mosaic’ was built under R version 3.6.2Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: ‘mosaic’
The following objects are masked from ‘package:dplyr’:
count, do, tally
The following object is masked from ‘package:Matrix’:
mean
The following object is masked from ‘package:ggplot2’:
stat
The following objects are masked from ‘package:stats’:
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test, quantile, sd, t.test,
var
The following objects are masked from ‘package:base’:
max, mean, min, prod, range, sample, sum
library(mosaicData)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ tibble 3.0.6 ✓ purrr 0.3.4
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.1
package ‘tibble’ was built under R version 3.6.2package ‘tidyr’ was built under R version 3.6.2package ‘readr’ was built under R version 3.6.2package ‘purrr’ was built under R version 3.6.2── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x mosaic::count() masks dplyr::count()
x purrr::cross() masks mosaic::cross()
x mosaic::do() masks dplyr::do()
x tidyr::expand() masks Matrix::expand()
x dplyr::filter() masks stats::filter()
x ggstance::geom_errorbarh() masks ggplot2::geom_errorbarh()
x dplyr::lag() masks stats::lag()
x tidyr::pack() masks Matrix::pack()
x mosaic::stat() masks ggplot2::stat()
x mosaic::tally() masks dplyr::tally()
x tidyr::unpack() masks Matrix::unpack()
library(janitor)
package ‘janitor’ was built under R version 3.6.2
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(GGally)
package ‘GGally’ was built under R version 3.6.2Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
diamonds <- read_csv("diamonds.csv") %>% clean_names()
head(diamonds)
We expect the carat of the diamonds to be strong correlated with the physical dimensions x, y and z. Use ggpairs() to investigate correlations between these four variables.
alias(lm(carat ~ ., data = diamonds))
Model :
carat ~ x1 + cut + color + clarity + depth + table + price +
x + y + z
plotModel(model1)
Error: Problem with `mutate()` input `.color`.
x $ operator is invalid for atomic vectors
ℹ Input `.color` is `my_interaction(x$data[, intersect(discreteVars, restVars), drop = FALSE])`.
Run `rlang::last_error()` to see where the error occurred.
So, we do find significant correlations. Let’s drop columns x, y and z from the dataset, in preparation to use only carat going forward.
#We are interested in developing a regression model for the price of a diamond in terms of the possible predictor variables in the dataset. #Use ggpairs() to investigate correlations between price and the predictors (this may take a while to run, don’t worry, make coffee or something). #Perform further ggplot visualisations of any significant correlations you find.
ggpairs(diamonds_trim)
model2 <- lm(price ~ + carat, data = diamonds_trim)
autoplot(model2)
plotModel(model2)
summary(model2)
Call:
lm(formula = price ~ +carat, data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-18585.3 -804.8 -18.9 537.4 12731.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2256.36 13.06 -172.8 <2e-16 ***
carat 7756.43 14.07 551.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
model3 <- lm(price ~ + carat + cut, data = diamonds_trim)
autoplot(model3)
plotModel(model3)
summary(model3)
Call:
lm(formula = price ~ +carat + cut, data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-17540.7 -791.6 -37.6 522.1 12721.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3875.47 40.41 -95.91 <2e-16 ***
carat 7871.08 13.98 563.04 <2e-16 ***
cutGood 1120.33 43.50 25.75 <2e-16 ***
cutIdeal 1800.92 39.34 45.77 <2e-16 ***
cutPremium 1439.08 39.87 36.10 <2e-16 ***
cutVery Good 1510.14 40.24 37.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1511 on 53934 degrees of freedom
Multiple R-squared: 0.8565, Adjusted R-squared: 0.8565
F-statistic: 6.437e+04 on 5 and 53934 DF, p-value: < 2.2e-16
model4 <- lm(price ~ + carat + cut + color, data = diamonds_trim)
autoplot(model4)
plotModel(model4)
summary(model4)
Call:
lm(formula = price ~ +carat + cut + color, data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-17313.9 -751.2 -83.9 543.6 12273.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3760.05 41.32 -90.994 < 2e-16 ***
carat 8183.74 13.90 588.885 < 2e-16 ***
cutGood 1126.98 41.23 27.336 < 2e-16 ***
cutIdeal 1808.04 37.29 48.486 < 2e-16 ***
cutPremium 1442.73 37.78 38.189 < 2e-16 ***
cutVery Good 1518.00 38.14 39.804 < 2e-16 ***
colorE -90.65 22.63 -4.005 6.20e-05 ***
colorF -71.72 22.78 -3.148 0.00164 **
colorG -103.62 22.07 -4.694 2.68e-06 ***
colorH -732.17 23.71 -30.883 < 2e-16 ***
colorI -1075.68 26.58 -40.464 < 2e-16 ***
colorJ -1908.56 32.87 -58.055 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1432 on 53928 degrees of freedom
Multiple R-squared: 0.8711, Adjusted R-squared: 0.8711
F-statistic: 3.315e+04 on 11 and 53928 DF, p-value: < 2.2e-16
model5 <- lm(price ~ + carat + cut + color, data = diamonds_trim)
autoplot(model5)
plotModel(model5)
summary(model5)
Call:
lm(formula = price ~ +carat + cut + color, data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-17313.9 -751.2 -83.9 543.6 12273.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3760.05 41.32 -90.994 < 2e-16 ***
carat 8183.74 13.90 588.885 < 2e-16 ***
cutGood 1126.98 41.23 27.336 < 2e-16 ***
cutIdeal 1808.04 37.29 48.486 < 2e-16 ***
cutPremium 1442.73 37.78 38.189 < 2e-16 ***
cutVery Good 1518.00 38.14 39.804 < 2e-16 ***
colorE -90.65 22.63 -4.005 6.20e-05 ***
colorF -71.72 22.78 -3.148 0.00164 **
colorG -103.62 22.07 -4.694 2.68e-06 ***
colorH -732.17 23.71 -30.883 < 2e-16 ***
colorI -1075.68 26.58 -40.464 < 2e-16 ***
colorJ -1908.56 32.87 -58.055 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1432 on 53928 degrees of freedom
Multiple R-squared: 0.8711, Adjusted R-squared: 0.8711
F-statistic: 3.315e+04 on 11 and 53928 DF, p-value: < 2.2e-16
model6 <- lm(price ~ + carat + cut + color + clarity, data = diamonds_trim)
autoplot(model6)
plotModel(model6)
summary(model6)
Call:
lm(formula = price ~ +carat + cut + color + clarity, data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-16813.5 -680.4 -197.6 466.4 10394.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7362.80 51.68 -142.46 <2e-16 ***
carat 8886.13 12.03 738.44 <2e-16 ***
cutGood 655.77 33.63 19.50 <2e-16 ***
cutIdeal 998.25 30.66 32.56 <2e-16 ***
cutPremium 869.40 30.93 28.11 <2e-16 ***
cutVery Good 848.72 31.28 27.14 <2e-16 ***
colorE -211.68 18.32 -11.56 <2e-16 ***
colorF -303.31 18.51 -16.39 <2e-16 ***
colorG -506.20 18.12 -27.93 <2e-16 ***
colorH -978.70 19.27 -50.78 <2e-16 ***
colorI -1440.30 21.65 -66.54 <2e-16 ***
colorJ -2325.22 26.72 -87.01 <2e-16 ***
clarityIF 5419.65 52.14 103.95 <2e-16 ***
claritySI1 3573.69 44.60 80.13 <2e-16 ***
claritySI2 2625.95 44.79 58.63 <2e-16 ***
clarityVS1 4534.88 45.54 99.59 <2e-16 ***
clarityVS2 4217.83 44.84 94.06 <2e-16 ***
clarityVVS1 5072.03 48.21 105.20 <2e-16 ***
clarityVVS2 4967.20 46.89 105.93 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1157 on 53921 degrees of freedom
Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
F-statistic: 3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16
model7 <- lm(price ~ + carat + cut + color + clarity + depth, data = diamonds_trim)
autoplot(model7)
plotModel(model7)
summary(model7)
Call:
lm(formula = price ~ +carat + cut + color + clarity + depth,
data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-16805.0 -680.3 -197.9 466.2 10393.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6902.043 245.309 -28.136 <2e-16 ***
carat 8885.816 12.034 738.362 <2e-16 ***
cutGood 644.141 34.173 18.849 <2e-16 ***
cutIdeal 982.153 31.780 30.905 <2e-16 ***
cutPremium 849.908 32.551 26.110 <2e-16 ***
cutVery Good 833.287 32.291 25.806 <2e-16 ***
colorE -211.836 18.316 -11.566 <2e-16 ***
colorF -303.274 18.509 -16.385 <2e-16 ***
colorG -505.360 18.127 -27.879 <2e-16 ***
colorH -977.533 19.281 -50.699 <2e-16 ***
colorI -1439.080 21.655 -66.455 <2e-16 ***
colorJ -2323.871 26.731 -86.935 <2e-16 ***
clarityIF 5415.009 52.191 103.754 <2e-16 ***
claritySI1 3571.383 44.613 80.053 <2e-16 ***
claritySI2 2623.014 44.813 58.532 <2e-16 ***
clarityVS1 4531.387 45.570 99.437 <2e-16 ***
clarityVS2 4214.967 44.865 93.948 <2e-16 ***
clarityVVS1 5068.355 48.248 105.049 <2e-16 ***
clarityVVS2 4963.722 46.924 105.781 <2e-16 ***
depth -7.160 3.727 -1.921 0.0547 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1157 on 53920 degrees of freedom
Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
F-statistic: 3.092e+04 on 19 and 53920 DF, p-value: < 2.2e-16
model8 <- lm(price ~ + carat + cut + color + clarity + depth + table, data = diamonds_trim)
autoplot(model8)
plotModel(model8)
summary(model8)
Call:
lm(formula = price ~ +carat + cut + color + clarity + depth +
table, data = diamonds_trim)
Residuals:
Min 1Q Median 3Q Max
-16828.8 -678.7 -199.4 464.6 10341.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4555.171 373.482 -12.197 < 2e-16 ***
carat 8895.194 12.079 736.390 < 2e-16 ***
cutGood 614.424 34.337 17.894 < 2e-16 ***
cutIdeal 877.569 34.152 25.696 < 2e-16 ***
cutPremium 806.024 32.954 24.459 < 2e-16 ***
cutVery Good 778.428 32.936 23.635 < 2e-16 ***
colorE -210.849 18.304 -11.519 < 2e-16 ***
colorF -304.288 18.498 -16.450 < 2e-16 ***
colorG -506.964 18.116 -27.984 < 2e-16 ***
colorH -977.974 19.269 -50.754 < 2e-16 ***
colorI -1438.277 21.642 -66.459 < 2e-16 ***
colorJ -2322.565 26.715 -86.940 < 2e-16 ***
clarityIF 5404.237 52.174 103.582 < 2e-16 ***
claritySI1 3567.794 44.587 80.020 < 2e-16 ***
claritySI2 2619.004 44.788 58.476 < 2e-16 ***
clarityVS1 4525.400 45.547 99.356 < 2e-16 ***
clarityVS2 4210.194 44.840 93.893 < 2e-16 ***
clarityVVS1 5061.734 48.224 104.964 < 2e-16 ***
clarityVVS2 4957.310 46.901 105.697 < 2e-16 ***
depth -21.024 4.079 -5.154 2.56e-07 ***
table -24.803 2.978 -8.329 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1156 on 53919 degrees of freedom
Multiple R-squared: 0.9161, Adjusted R-squared: 0.916
F-statistic: 2.942e+04 on 20 and 53919 DF, p-value: < 2.2e-16
Shortly we may try a regression fit using one or more of the categorical predictors cut, clarity and color, so let’s investigate these predictors:
Investigate the factor levels of these predictors. How many dummy variables do you expect for each of them?
Use the dummy_cols() function in the fastDummies package to generate dummies for these predictors and check the number of dummies in each case.